Download auto data from the Statistical Learning book website here: http://www-bcf.usc.edu/~gareth/ISL/data.html
Today, we are going over Hierarchical clustering, K-Means Clustering, PCA, and ICA.
# read in Auto data
Auto_data <- read_csv("~/Desktop/MS/fall/machine learning/assignments/hw1-taitea/Auto.csv")
## Parsed with column specification:
## cols(
## mpg = col_double(),
## cylinders = col_double(),
## displacement = col_double(),
## horsepower = col_character(),
## weight = col_double(),
## acceleration = col_double(),
## year = col_double(),
## origin = col_double(),
## name = col_character()
## )
#remove cars with unknown horsepower and set horsepower to numeric
Auto_data <- Auto_data %>%
filter(horsepower != "?") %>%
mutate(horsepower = as.numeric(horsepower)) %>%
as.data.frame()
#save car names
Auto_data_names <- Auto_data$name
#data to cluster
Auto_data_clust <- Auto_data[,1:8]
dim(Auto_data_clust)
## [1] 392 8
#392 is too much for a demo, so lets take the first 25
Auto_data_clust <- Auto_data_clust[1:25,]
rownames(Auto_data_clust) <- Auto_data_names[1:25]
Step 1. Assign each item to it’s own cluster. We start with 25 clusters, one for each car.
Step 2. Calculate a proximity matrix between each cluster.
Step 3. Find the pair of clusters closest to each other.
Step 4. Merge these clusters and then recalculate similarity between clusters. Some options are: single linkage (distance is calculated from the nearest neighbors), complete linkage (distance is calculated from furthest neighbor), average linkage (distance is calculated from mean of different clusters).
Step 5. Repeat Step 3 and 4 until there is only one cluster.
Step 1. Each car is a cluster.
Step 2. Create a distance matrix from Auto_data_clust.
help("dist")
## Help on topic 'dist' was found in the following packages:
##
## Package Library
## factoextra /Users/taitea/Library/R/3.6/library
## stats /Library/Frameworks/R.framework/Versions/3.6/Resources/library
##
##
## Using the first match ...
hierarchical_dist <- as.matrix(dist(Auto_data_clust, method = "euclidean"))
#View(hierarchical_dist)
Step 3. Find the two cars that are the most similar to each other and print the names of those two cars
diag(hierarchical_dist) <- NA
arrayInd(which.min(hierarchical_dist), dim(hierarchical_dist))
## [,1] [,2]
## [1,] 23 15
#postitions 23 and 15 are the most similar. Lets go back to the names of the cars
Auto_data_names[23]
## [1] "saab 99e"
Auto_data_names[15]
## [1] "toyota corona mark ii"
Step 4. Merge the two clusters together using average linkage.
#replace pos 15 with the average of pos 15 and 23
hierarchical_dist[,15] <- apply((hierarchical_dist[,c(23,15)]),1,mean)
hierarchical_dist[15,] <- apply((hierarchical_dist[c(23,15),]),2,mean)
#remove pos 23
hierarchical_dist <- hierarchical_dist[-23,-23]
#now position 15 represents the cluster containing the saab99e and the toyota corona mark ii
Step 5. To complete the algorithm, go back to step 3 and iterate through all of the previous steps until there are no more rows left
diag(hierarchical_dist) <- NA
arrayInd(which.min(hierarchical_dist), dim(hierarchical_dist))
## [,1] [,2]
## [1,] 4 3
#postitions 4 and 3 are the most similar
Auto_data_names[4]
## [1] "amc rebel sst"
Auto_data_names[3]
## [1] "plymouth satellite"
Now that we know how the algorithm works, let’s use the R function hclust. Plot the Dendogram resulting from clustering the Auto_data_clust using average linkage.
hierarchical_dist <- dist(Auto_data_clust, method = "euclidean")
tree <- hclust(hierarchical_dist, method="average")
plot(tree)
There is one more element to hierarchical clustering: Cutting the tree. Here, we can control how many clusters we want or the height of the tree.
#help(cutree)
# cut tree into 3 clusters
tree <- hclust(hierarchical_dist, method="average")
plot(tree)
tree_k2 <- cutree(tree, k = 2)
# plot the tree before running this line
rect.hclust(tree, k = 3, h = NULL)
Principal Components Analysis is a linear dimensionality reduction algorithm. If you want to learn more about linear algebra, I suggest the MIT Open Courseware class here : https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/ There are two ways of doing PCA, Single Value Decomposition (SVD), and the method we will use today, using the covariance matrix of the data.
Step 1. Center data by subtracting the mean.
Step 2. Calculate covariance matrix of data.
Step 3. Perform Eigendecomposition of the covariance matrix. i.e. represent the matrix in terms of it’s eigenvalues and eigen vectors
Step 4. Multiply the eigen vectors by the original data to express the data in terms of the eigen vectors.
Step 1. Center the data by subtracting the mean of the each column from the values in that column
Auto_data_clust_pca <- data.matrix(Auto_data_clust)
Center_auto <- apply(Auto_data_clust_pca, 2, function(x) x - mean(x))
Step 2. Calculate covariance matrix of the Auto data
Covariance_auto <- cov(Center_auto)
Step 3. Calculate eigen values and vectors
Eigen_value_auto <- eigen(Covariance_auto)$value
#columns are the eigen vectors
Eigen_vector_auto <- eigen(Covariance_auto)$vector
Step 4. Multiply the eigen vector matrix by the original data.
PC <- as.data.frame(data.matrix(Center_auto) %*% Eigen_vector_auto)
ggplot(PC, aes(PC[,1], PC[,2])) + geom_point(aes(PC[,1], PC[,2]))
#+ geom_text(aes(label=Auto_data_names[1:8]), nudge_x = -2.5, nudge_y = 400)
Step 5. Find out which principal components explain the variance in the data.
#for each component, take the cumulative sum of eigen values up to that point and and divide by the total sum of eigen values
round(cumsum(Eigen_value_auto)/sum(Eigen_value_auto) * 100, digits = 2)
## [1] 99.52 99.95 100.00 100.00 100.00 100.00 100.00 100.00
Principal component 1 and 2 explain 99.99 percent of the variance. Principal component 1,2, and 3 together explain 100% of the variance in the data.
Now that we know how PCA works, lets use the R funtion prcomp.
help("prcomp")
autoplot(prcomp(Auto_data_clust_pca))
ICA is an algorithm that finds components that are independent, subcomponents of the data.
Step 1. Whiten the data by projecting the data onto the eigen vectors (PCA).
Step 2. Solve the X=AS equation by maximizing non-gaussianty in the variables(components) in S.
This results in a matrix S with components that are independent from each other.
We will use the fastICA algorithm.
First we will go backwards. Create a matrix S with the independent components
#create two signals
S <- cbind(cos((1:500)/10), ((500:1)/1000))
par(mfcol = c(1, 2))
plot(S[,1], type="l")
plot(S[,2], type="l")
Create a mixing matrix A
A <- matrix(c(0.5, 0.7, 0.423, 0.857), 2, 2)
Mix S using A
X <- S %*% A
par(mfcol = c(1, 2))
plot(X[,1], type="l")
plot(X[,2], type="l")
Unmix using fastICA
par(mfcol = c(1, 2))
plot(1:500, a$S[,1], type = "l", xlab = "S'1", ylab = "")
plot(1:500, a$S[,2], type = "l", xlab = "S'2", ylab = "")
plot the independent components as a heatmap
heatmap(a$S)
data(iris)
Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width.#subsetting by removing the extra variable and saving as a new dataset
iris_mod <- iris[c(-5)]
#saving species names
species_names <- iris$Species
#converting data from list to matrix
iris_mod_matrix=as.matrix(iris_mod)
centers <- iris_mod_matrix[sample(nrow(iris_mod_matrix), 3),]
#defining euclidean distance to be used in kmeans calculation
euclid <- function(points1, points2) {
distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
for(i in 1:nrow(points2)) {
distanceMatrix[,i] <- sqrt(rowSums(t(t(points1)-points2[i,])^2))
}
distanceMatrix
}
#defining kmeans clustering as a function
K_means <- function(x, centers, distFun, nItter) {
clusterHistory <- vector(nItter, mode="list")
centerHistory <- vector(nItter, mode="list")
for(i in 1:nItter) {
distsToCenters <- distFun(x, centers)
clusters <- apply(distsToCenters, 1, which.min)
centers <- apply(x, 2, tapply, clusters, mean)
clusterHistory[[i]] <- clusters
centerHistory[[i]] <- centers
}
list(clusters=clusterHistory, centers=centerHistory)
}
#calling function for 2 iterations on data matrix and printing results
iris_cluster <- K_means(iris_mod_matrix, centers, euclid, 2)
iris_cluster
## $clusters
## $clusters[[1]]
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 1 3 1 3 1 3 1 1 1 1 1 1 3 1 1 1 1
## [71] 3 1 3 1 1 3 3 3 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3
## [106] 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3
##
## $clusters[[2]]
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 1 3 1 3 1 3 1 3 1 1 1 1 1 1 3 1 1 1 1
## [71] 3 1 3 1 1 1 3 3 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3
## [106] 3 1 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3
##
##
## $centers
## $centers[[1]]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.681579 2.681579 4.113158 1.286842
## 2 5.006000 3.428000 1.462000 0.246000
## 3 6.617742 2.988710 5.391935 1.914516
##
## $centers[[2]]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.729268 2.690244 4.151220 1.300000
## 2 5.006000 3.428000 1.462000 0.246000
## 3 6.632203 2.998305 5.430508 1.937288
#Step 1. Center data by subtracting the mean.
iris_data_clust_pca <- data.matrix(iris_mod)
center_iris <- apply(iris_data_clust_pca, 2, function(x) x - mean(x))
#Step 2. Calculate covariance matrix of data.
covariance_iris <- cov(center_iris)
#Step 3. Perform Eigendecomposition of the covariance matrix
eigen_value_iris <- eigen(covariance_iris)$value
eigen_vector_iris <- eigen(covariance_iris)$vector
#Step 4. Multiply the eigen vectors by the original data to express the data in terms of the eigen vectors.
PC <- as.data.frame(data.matrix(center_iris) %*% eigen_vector_iris)
#Step 5. Find out which principal components explain the variance in the data.
#plot PC1 and PC2 in scatterplot
ggplot(PC, aes(PC[,1], PC[,2])) + geom_point(aes(PC[,1], PC[,2]))
#calculate variance explained
round(cumsum(eigen_value_iris)/sum(eigen_value_iris) * 100, digits = 2)
## [1] 92.46 97.77 99.48 100.00
PC1 explains 92.46 percent of the variance. PC1 and PC2 explain 97.77 percent of the variance.
#create a matrix S with the independent components
S <- cbind(cos((1:500)/10), ((500:1)/1000))
par(mfcol = c(1, 2))
plot(S[,1], type="l")
plot(S[,2], type="l")
#create a mixing matrix A
A <- matrix(c(0.5, 0.7, 0.423, 0.857), 2, 2)
#mix S using A
X <- S %*% A
par(mfcol = c(1, 2))
plot(X[,1], type="l")
plot(X[,2], type="l")
#unmix using FastICA
a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
#ICA on the iris data
iris_ica <- fastICA(iris_mod, 7, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
#heatmap of iris ICA
heatmap(iris_ica$S)
library(cluster)
library(purrr)
#generating silhouette plot for a variety of cluster numbers
iris_k2 <- pam(iris, k=2)
plot(silhouette(iris_k2))
iris_k3 <- pam(iris, k=3)
plot(silhouette(iris_k3))
iris_k4 <- pam(iris, k=4)
plot(silhouette(iris_k4))
#calculating highest average silhouette width
sil_width <- map_dbl(2:10, function(k) {model <- pam(x = iris, k = k)
model$silinfo$avg.width})
sil_df <- data.frame(k = 2:10, sil_width = sil_width)
print(sil_df)
## k sil_width
## 1 2 0.6825351
## 2 3 0.5728779
## 3 4 0.5036421
## 4 5 0.4948241
## 5 6 0.5074318
## 6 7 0.3725349
## 7 8 0.3440288
## 8 9 0.3040292
## 9 10 0.3052771
#kmeans clustering
set.seed(20)
clusters <- kmeans(iris_mod, 2)
iris$clusters <- as.factor(clusters$cluster)
str(clusters)
## List of 9
## $ cluster : int [1:150] 1 1 1 1 1 1 1 1 1 1 ...
## $ centers : num [1:2, 1:4] 5.01 6.3 3.37 2.89 1.56 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "1" "2"
## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## $ totss : num 681
## $ withinss : num [1:2] 28.6 123.8
## $ tot.withinss: num 152
## $ betweenss : num 529
## $ size : int [1:2] 53 97
## $ iter : int 1
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
#PCA plot by cluster
set.seed(20)
autoplot(kmeans(iris_mod, 2), data = iris_mod)
The silhouette analysis indicated that the optimal number of clusters is 2, but the dataset includes 3 species, so it does not cluster by species.
#hierarchical clustering with average linkage and euclidean distance metric
iris_hierarchical_dist_eu <- dist(iris_mod, method = "euclidean")
tree_1 <- hclust(iris_hierarchical_dist_eu, method="average")
plot(tree_1)
#hierarchical clustering with average linkage and maximum distance metric
iris_hierarchical_dist_max <- dist(iris_mod, method = "maximum")
tree_2 <- hclust(iris_hierarchical_dist_max, method="average")
plot(tree_2)
#hierarchical clustering with single linkage and euclidean distance metric
tree_3 <- hclust(iris_hierarchical_dist_eu, method="single")
plot(tree_3)
#hierarchical clustering with single linkage and maximum distance metric
tree_4 <- hclust(iris_hierarchical_dist_max, method="single")
plot(tree_4)
#hierarchical clustering with average linkage and euclidean distance metric with 2 cut points (3 and 4)
tree_5 <- hclust(iris_hierarchical_dist_eu, method="average")
plot(tree_5)
tree_k3 <- cutree(tree_5, k = 3)
rect.hclust(tree_5, k = 3, h = NULL)
tree_6 <- hclust(iris_hierarchical_dist_eu, method="average")
plot(tree_6)
tree_k4 <- cutree(tree_6, k = 4)
rect.hclust(tree_6, k = 4, h = NULL)
#plots of PCA colored by hierarchical clusters computed above
iris_tree_1 <- hcut(iris_hierarchical_dist_eu, hc_method = "average")
fviz_cluster(iris_tree_1, ellipse.type = "convex", data = iris_mod)
iris_tree_2 <- hcut(iris_hierarchical_dist_max, hc_method = "average")
fviz_cluster(iris_tree_2, ellipse.type = "convex", data = iris_mod)
iris_tree_3 <- hcut(iris_hierarchical_dist_eu, hc_method = "single")
fviz_cluster(iris_tree_3, ellipse.type = "convex", data = iris_mod)
iris_tree_4 <- hcut(iris_hierarchical_dist_max, hc_method = "single")
fviz_cluster(iris_tree_4, ellipse.type = "convex", data = iris_mod)
iris_tree_5 <- hcut(iris_hierarchical_dist_eu, k = 3, hc_method = "average")
fviz_cluster(iris_tree_5, ellipse.type = "convex", data = iris_mod)
iris_tree_6 <- hcut(iris_hierarchical_dist_eu, k = 4, hc_method = "average")
fviz_cluster(iris_tree_6, ellipse.type = "convex", data = iris_mod)
On PCA:
Eigen Vectors and Eigen Values http://www.visiondummy.com/2014/03/eigenvalues-eigenvectors/ Linear Algebra by Prof. Gilbert Strang https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/video-lectures/ http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
On ICA:
Independent Component Analysis: Algorithms and Applications https://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf Tutorial on ICA taken from http://rstudio-pubs-static.s3.amazonaws.com/93614_be30df613b2a4707b3e5a1a62f631d19.html